In [1]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

Coding Project: Final Draft¶

Importing the Data¶

1. In the code cell below, read in your data set¶

In [2]:
# Enter your code here
import pandas as pd
import numpy as np
df = pd.read_csv("census.csv")
df
Out[2]:
state county fips population median_age pct_over_65 median_income per_capita_income pct_white pct_black pct_asian pct_hispanic pct_college unemployment pop_density is_metro region
0 Alabama Autauga County 1001 55380 38.2 15.0 58731 29819 76.8 19.0 1.0 2.8 26.6 3.5 91.8 1.0 Southeast
1 Alabama Baldwin County 1003 212830 43.0 20.0 58320 32626 86.2 9.3 0.9 4.6 31.9 4.0 114.6 1.0 Southeast
2 Alabama Barbour County 1005 25361 40.4 18.6 32525 18473 46.8 47.6 0.5 4.4 11.6 9.4 31.0 0.0 Southeast
3 Alabama Bibb County 1007 22493 40.9 15.9 47542 20778 76.8 22.3 0.1 2.6 10.4 7.0 36.8 1.0 Southeast
4 Alabama Blount County 1009 57681 40.7 17.9 49358 24747 95.5 1.6 0.4 9.3 13.1 3.1 88.9 1.0 Southeast
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3137 Wyoming Sweetwater County 56037 43521 35.3 11.4 74843 32603 93.4 1.2 0.8 15.9 22.5 5.7 4.2 0.0 West
3138 Wyoming Teton County 56039 23280 39.3 14.0 84678 54051 89.3 1.2 1.3 15.0 57.0 0.7 5.3 0.0 West
3139 Wyoming Uinta County 56041 20479 35.8 13.0 63403 28159 93.4 0.1 0.2 9.1 16.0 5.5 10.1 0.0 West
3140 Wyoming Washakie County 56043 8027 42.9 21.1 54158 28101 89.7 0.0 0.0 14.2 23.4 4.1 3.8 0.0 West
3141 Wyoming Weston County 56045 7049 43.1 19.4 57031 28531 97.4 0.2 0.8 1.1 20.0 4.0 3.0 0.0 West

3142 rows × 17 columns

2. Run the command to get a scatterplot matrix (pandas.plotting.scatter_matrix()). Note: if you have a large number of columns in your dataset, you should select the 5-10 relevant columns to plot.¶

In [3]:
# Enter your code here
pd.plotting.scatter_matrix(df[['median_income', 'median_age','pct_college', 'population', 'unemployment', 'is_metro', 'region']])
Out[3]:
array([[<Axes: xlabel='median_income', ylabel='median_income'>,
        <Axes: xlabel='median_age', ylabel='median_income'>,
        <Axes: xlabel='pct_college', ylabel='median_income'>,
        <Axes: xlabel='population', ylabel='median_income'>,
        <Axes: xlabel='unemployment', ylabel='median_income'>,
        <Axes: xlabel='is_metro', ylabel='median_income'>],
       [<Axes: xlabel='median_income', ylabel='median_age'>,
        <Axes: xlabel='median_age', ylabel='median_age'>,
        <Axes: xlabel='pct_college', ylabel='median_age'>,
        <Axes: xlabel='population', ylabel='median_age'>,
        <Axes: xlabel='unemployment', ylabel='median_age'>,
        <Axes: xlabel='is_metro', ylabel='median_age'>],
       [<Axes: xlabel='median_income', ylabel='pct_college'>,
        <Axes: xlabel='median_age', ylabel='pct_college'>,
        <Axes: xlabel='pct_college', ylabel='pct_college'>,
        <Axes: xlabel='population', ylabel='pct_college'>,
        <Axes: xlabel='unemployment', ylabel='pct_college'>,
        <Axes: xlabel='is_metro', ylabel='pct_college'>],
       [<Axes: xlabel='median_income', ylabel='population'>,
        <Axes: xlabel='median_age', ylabel='population'>,
        <Axes: xlabel='pct_college', ylabel='population'>,
        <Axes: xlabel='population', ylabel='population'>,
        <Axes: xlabel='unemployment', ylabel='population'>,
        <Axes: xlabel='is_metro', ylabel='population'>],
       [<Axes: xlabel='median_income', ylabel='unemployment'>,
        <Axes: xlabel='median_age', ylabel='unemployment'>,
        <Axes: xlabel='pct_college', ylabel='unemployment'>,
        <Axes: xlabel='population', ylabel='unemployment'>,
        <Axes: xlabel='unemployment', ylabel='unemployment'>,
        <Axes: xlabel='is_metro', ylabel='unemployment'>],
       [<Axes: xlabel='median_income', ylabel='is_metro'>,
        <Axes: xlabel='median_age', ylabel='is_metro'>,
        <Axes: xlabel='pct_college', ylabel='is_metro'>,
        <Axes: xlabel='population', ylabel='is_metro'>,
        <Axes: xlabel='unemployment', ylabel='is_metro'>,
        <Axes: xlabel='is_metro', ylabel='is_metro'>]], dtype=object)
No description has been provided for this image

Feature Engineering¶

3. Do some feature engineering with your data...I'd like you to do at least 2 of the following:¶

- Fix the data type if any of your variables were read in incorrectly (e.g., a categorical variable is being read in as an integer/float or a date being read in as an object)¶
- Transform variables as needed (e.g., create a squared term if there is curvature in a scatterplot, log a variable that is positive and skewed)¶
- Create an interaction term if you believe it makes sense to combine 2 variables¶
- Create a categorical variable from a numeric variable (e.g., create True/False categories or use numpy.select() to create bins)¶

Add as many code cells as needed and write a few sentences below to explain why you're making these changes¶

I will create a categorical variable to identify affluent towns. This will be defined as towns where the median income is in the top quartile.

I will create a column that shows the percentage of the population that is non-white. NGF research says this segment of the population is currently driving the growth of golf, so it could be helpful to keep track of this percentage directly.

Lastly, I will create a column that takes a combination of income, education, diversity, and population and assigns a weight to each. This composite column will represent the opportunity score for golf marketing efforts.

In [4]:
df["affluent"] = df["median_income"] > 59867.250000
df.head()
Out[4]:
state county fips population median_age pct_over_65 median_income per_capita_income pct_white pct_black pct_asian pct_hispanic pct_college unemployment pop_density is_metro region affluent
0 Alabama Autauga County 1001 55380 38.2 15.0 58731 29819 76.8 19.0 1.0 2.8 26.6 3.5 91.8 1.0 Southeast False
1 Alabama Baldwin County 1003 212830 43.0 20.0 58320 32626 86.2 9.3 0.9 4.6 31.9 4.0 114.6 1.0 Southeast False
2 Alabama Barbour County 1005 25361 40.4 18.6 32525 18473 46.8 47.6 0.5 4.4 11.6 9.4 31.0 0.0 Southeast False
3 Alabama Bibb County 1007 22493 40.9 15.9 47542 20778 76.8 22.3 0.1 2.6 10.4 7.0 36.8 1.0 Southeast False
4 Alabama Blount County 1009 57681 40.7 17.9 49358 24747 95.5 1.6 0.4 9.3 13.1 3.1 88.9 1.0 Southeast False
In [5]:
df["diversity"] = 100 - df["pct_white"]

Normalize each component to 0-1¶

income_score = (df['median_income'] / df['median_income'].max()).clip(0, 1) education_score = (df['pct_college'] / df['pct_college'].max()).clip(0, 1) diversity_score = (df['diversity'] / df['diversity'].max()).clip(0, 1) size_score = (np.log1p(df['population']) / np.log1p(df['population'].max())).clip(0, 1)

Younger counties get a boost, but only if also affluent¶

peaks around age 30-35¶

age_score = (50 - df['median_age']).clip(0, 20) / 20

Weighted sum¶

df['opportunity_score'] = ( income_score * 35 + education_score * 25 + diversity_score * 15 + size_score * 15 + age_score * 10 ).round(1)

df.head()

Inferential Modeling¶

4. Build a regression model (either OLS, Logistic, or Poisson) explaining the relationships between 1 dependent variable of interest and 2 or more independent variables...be sure to build the appropriate model type based on your dependent variable¶

In [6]:
import statsmodels.formula.api as smf 
results = smf.ols("median_income ~ median_age + pct_college + population + unemployment + diversity", data = df).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          median_income   R-squared:                       0.556
Model:                            OLS   Adj. R-squared:                  0.555
Method:                 Least Squares   F-statistic:                     785.8
Date:                Sun, 22 Feb 2026   Prob (F-statistic):               0.00
Time:                        17:11:31   Log-Likelihood:                -33221.
No. Observations:                3142   AIC:                         6.645e+04
Df Residuals:                    3136   BIC:                         6.649e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     4.686e+04   1657.753     28.268      0.000    4.36e+04    5.01e+04
median_age    -170.9951     33.829     -5.055      0.000    -237.323    -104.667
pct_college    905.3288     20.031     45.197      0.000     866.054     944.603
population       0.0030      0.001      5.505      0.000       0.002       0.004
unemployment -1134.7679     78.488    -14.458      0.000   -1288.661    -980.875
diversity      -52.6311     12.394     -4.247      0.000     -76.932     -28.330
==============================================================================
Omnibus:                      310.165   Durbin-Watson:                   1.569
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              873.272
Skew:                           0.535   Prob(JB):                    2.35e-190
Kurtosis:                       5.351   Cond. No.                     3.41e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.41e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

5. Based on your model output, interpret the coefficients for at least 2 statistically significant variables, making sure to quantify the relationships between those variables and your dependent variable. If you don't have 2 statistically significant variables in your model, provide an explanation of what you expected would be significant and any theories as to why it might not be significant in your model¶

pct_college = 931.9263. For each 1 percent increase in college educated population, median income increases by 931.9263 dollars. unemployment = -1263.5811. For each 1 percent increase in the unemployed population, median income decreases by -1263.5811 dollars.

6. Write a few sentences about your conclusions from the model. What was interesting or surprising about your results? Does your model confirm any hypotheses you might have had about your data? Does it change any of your hypotheses?¶

The relationship between education and income makes sense, because people with college degrees tend to have higher-paying jobs than those without. And the relationship between unemployment and median_income seems obvious too. I had expected the age variable to have a positive, significant relationship with income, but the model shows a negative, significant relationship. I believe my mistake comes from not taking into account the older, retired segment of the population with reduced income. In reality, the median income probably has a maximum when the median age is somewhere under retirement age, and the population is closer to their maximum earning potential.

Predictive Modeling¶

7. Pick 1 variable that you'd like to make predictions with. Create an outcome variable for this data¶

In [7]:
# Define thresholds
high_affluence = df['affluent'] == True
high_diversity = df['diversity'] >= df['diversity'].median()  # or pick a threshold like 30

# Count counties that are both
both = (high_affluence & high_diversity).sum()
pct_both = (both / len(df) * 100).round(1)

print(f"Counties that are both affluent AND high diversity: {both} ({pct_both}%)")
Counties that are both affluent AND high diversity: 426 (13.6%)
In [8]:
outcome = df['affluent']

8. Pick at least 5 other variables to include as features in your model. Create a features variable, making sure to create dummy variables for any categorical data you're including...add as many code cells as needed below¶

In [9]:
numericFeatures = df[['pct_college', 'median_age', 'diversity', 'population', 'unemployment', 'is_metro']]
dummies = pd.get_dummies(df[['region']], drop_first=True)
features = pd.concat([numericFeatures, dummies], axis=1)

9. Partition your data into training and test data¶

In [10]:
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features, 
                                                                          outcome, 
                                                                          test_size = 0.33, 
                                                                          random_state = 42)

10. Build 1 predictive machine learning model (Decision Tree, Random Forest, or Support Vector Machine) predicting your outcome variable from the features you've selected. If your outcome variable is categorical, you should use one of the Classifier models covered in the Week 8 hands-on videos. If your outcome variable is numeric, you should use a Regressor model...refer to the ML_RegressorModels.ipynb file in Jupyter Hub for that code¶

In [11]:
import sklearn.ensemble
modelForest = sklearn.ensemble.RandomForestClassifier(random_state = 42)

# 2. Fit the model using the training data
resultForest = modelForest.fit(featuresTrain, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predForestTrain = modelForest.predict(featuresTrain)
predForestTest = modelForest.predict(featuresTest)

# 4. Assess the fit
print(sklearn.metrics.classification_report(outcomeTrain, predForestTrain))
cmForestTest = sklearn.metrics.confusion_matrix(outcomeTest, predForestTest)
sklearn.metrics.ConfusionMatrixDisplay(cmForestTest, display_labels = modelForest.classes_).plot()
print(sklearn.metrics.classification_report(outcomeTest, predForestTest))
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      1574
        True       1.00      1.00      1.00       531

    accuracy                           1.00      2105
   macro avg       1.00      1.00      1.00      2105
weighted avg       1.00      1.00      1.00      2105

Out[11]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x77feaaa63a40>
              precision    recall  f1-score   support

       False       0.88      0.94      0.91       782
        True       0.76      0.61      0.68       255

    accuracy                           0.86      1037
   macro avg       0.82      0.77      0.79      1037
weighted avg       0.85      0.86      0.85      1037

No description has been provided for this image
In [12]:
!pip install plotly
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in /home/jupyter-welchcoo/.local/lib/python3.12/site-packages (6.5.2)
Requirement already satisfied: narwhals>=1.15.1 in /home/jupyter-welchcoo/.local/lib/python3.12/site-packages (from plotly) (2.16.0)
Requirement already satisfied: packaging in /opt/tljh/user/lib/python3.12/site-packages (from plotly) (24.1)
In [13]:
# Normalize each component to 0-1
income_score = (df['median_income'] / df['median_income'].max()).clip(0, 1)
education_score = (df['pct_college'] / df['pct_college'].max()).clip(0, 1)
diversity_score = (df['diversity'] / df['diversity'].max()).clip(0, 1)
size_score = (np.log1p(df['population']) / np.log1p(df['population'].max())).clip(0, 1)
age_score = ((50 - df['median_age']).clip(0, 20) / 20)  # younger counties score higher

# Weighted sum
df['opportunity_score'] = (
    income_score * 35 +
    education_score * 25 +
    diversity_score * 15 +
    size_score * 15 +
    age_score * 10
).round(1)
In [14]:
from plotnine import *
breaks = np.arange(0, 160000, 20000)
labels = ['$' + str(int(x/1000)) + 'k' for x in breaks]

a = (ggplot(df, aes(x='median_income', y='opportunity_score', size='population', color='population'))
 + geom_point(alpha=0.3)
 + labs(title='Opportunity Increases with Median Income',
        x='Median Household Income',
        y='Opportunity Score',
        size = 'Population')
 + theme_minimal()
 + scale_x_continuous(breaks = breaks, labels = labels)
)

b = (ggplot(df, aes(x='region', y='opportunity_score', fill='region'))
 + geom_boxplot()
 + labs(title='Northeast and West Lead in Opportunity Score',
        x='Region',
        y='Opportunity Score')
 + theme_minimal()
)

c= (ggplot(df, aes(x='median_age', y='opportunity_score', color='affluent', size='population'))
 + geom_point(alpha=0.5)
 + labs(title='High Opportunity Scores Peak in Affluent Counties',
        x='Median Age',
        y='Opportunity Score',
        color='')
 + scale_color_manual(labels = ["Non-affluent", "Affluent"],
                       values = ["lightgreen", "darkgreen"])
 + theme_minimal()
)

# Get top 100 counties
top100 = df.nlargest(100, 'opportunity_score')
top1000 = df.nlargest(1000, 'opportunity_score')

d = (ggplot(top100, aes(x='region', fill='region'))
 + geom_bar()
 + labs(title='Southeast Has The Most Counties in the Top 100',
        x='',
        y='Counties in Top 100',
        fill = '')
 + theme_minimal()
 + theme(legend_position='none')
)

# Count by state and get top 15 states
state_counts = top1000['state'].value_counts().head(15).reset_index()
state_counts.columns = ['state', 'count']

e = (ggplot(state_counts, aes(x='reorder(state, count)', y='count', fill='state'))
 + geom_bar(stat='identity')
 + coord_flip()
 + labs(title='Texas and Virginia Lead in High Opportunity counties',
        x='',
        y='Counties in top 1000')
 + theme_minimal()
 + theme(legend_position='none')
)


#ggsave(a, "OppInc.png")
#ggsave(b, "RegOpp.png")
#ggsave(c, "OppAff.png")
#ggsave(d, "top100.png")
#ggsave(e, "top1000.png")
a
b
c
d
e
Out[14]:
No description has been provided for this image
Out[14]:
No description has been provided for this image
Out[14]:
No description has been provided for this image
Out[14]:
No description has been provided for this image
Out[14]:
No description has been provided for this image
In [15]:
df['high_opportunity'] = df['opportunity_score'] >= 50

f = (ggplot(df, aes(x='region', fill='high_opportunity'))
 + geom_bar(position='fill')
 + labs(title='Concentration of High Opportunity Counties',
        x='',
        y='Proportion ',
        fill='High Opportunity')
 + scale_fill_manual(values=['lightgreen', 'darkgreen'],
                     labels=['No', 'Yes'])
 + theme_minimal()
)

ggsave(f, "topprop.png")
f
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:623: PlotnineWarning: Saving 6.4 x 4.8 in image.
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:624: PlotnineWarning: Filename: topprop.png
Out[15]:
No description has been provided for this image
In [16]:
# Create a flag for counties that are both affluent AND high diversity
df['both'] = (df['affluent'] == True) & (df['diversity'] >= df['diversity'].median())

g = (ggplot(df, aes(x='median_income', y='diversity', color='both'))
 + geom_point(alpha=0.5)
 + labs(title='The Intersection of Affluence and Diversity',
        x='Median Income',
        y='Diversity Index',
        color='High on Both')
 + scale_color_manual(values=['gray', 'blue'],
                      labels=['No', 'Yes (13.6%)'])
 + theme_minimal()
)

ggsave(g, "AffDiv.png")
g
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:623: PlotnineWarning: Saving 6.4 x 4.8 in image.
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:624: PlotnineWarning: Filename: AffDiv.png
Out[16]:
No description has been provided for this image
In [17]:
df.nlargest(10, 'opportunity_score')[['state', 'county', 'opportunity_score', 'median_income', 'pct_college', 'diversity', 'population']]
Out[17]:
state county opportunity_score median_income pct_college diversity population
2872 Virginia Loudoun County 79.1 142299 61.3 35.0 395134
2826 Virginia Arlington County 77.4 120071 75.3 28.5 233464
228 California Santa Clara County 75.9 124055 52.4 55.5 1927470
2848 Virginia Fairfax County 75.6 124831 61.6 38.8 1145862
2925 Virginia Falls Church city 74.1 127610 77.6 20.6 14128
1205 Maryland Howard County 74.0 121160 62.6 43.0 318855
223 California San Francisco County 73.3 112449 58.1 53.6 874961
226 California San Mateo County 72.0 122641 51.0 49.4 767423
1207 Maryland Montgomery County 71.3 108820 58.9 46.9 1043530
319 District of Columbia District of Columbia 69.7 86420 58.5 58.7 692683
In [18]:
import plotly.express as px
import plotly.io as pio

pio.renderers.default = 'notebook'
state_avg = df.groupby('state')['opportunity_score'].mean().reset_index()
state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
    'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'District of Columbia': 'DC'
}

state_avg['state_abbrev'] = state_avg['state'].map(state_abbrev)

fig = px.choropleth(state_avg,
                    locations='state_abbrev',
                    locationmode='USA-states',
                    color='opportunity_score',
                    scope='usa',
                    title='Coast to Coast Opportunities',
                    color_continuous_scale='RdYlGn',
                    range_color=[10, state_avg['opportunity_score'].max()],
                    labels={'opportunity_score': 'Opportunity Score'})
fig.update_layout(
    width=1000,
    height=600)
fig.show()

11. Assess the strength of your model on the training vs. test data. For classifier models, you should assess the models using the accuracy %. For regressor models, you should use root mean squared error (RMSE) as shown in the ML_RegressorModels.ipynb file referenced above.¶

Write a few sentences quantifying your model's strength...in your explanation, be sure to address whether or not you see evidence of overfitting¶

The random forest model has 86% accuracy on the test data. Since this is 15% less than it's accuracy on the training data, there is some overfitting happening. Looking at the confusion matrix, it does a better job of predicting non-affluent counties (94% recall) than affluent ones (61% recall).

Next Steps¶

12. Based on your model results, what do you think your next steps would be? Does the model seem strong enough to implement as is? Does it make sense to try other models? Do you think other data might strengthen your models if you had access to it?¶

Note: you do not need to create additional models for this assignment. But, prior to submitting your final memo, you should try out other models to see if you can build a stronger model.¶

Despite the tendencies discussed above, the 86% test accuracy means the model does a relatively good job of predicting which counties have median incomes in the top quartile, based on other demographic features. It could also make sense to try a regression model, using median_income itself as the outcome variable. In addition, some more detailed demographic data (post-grad education, median home value, school district ratings) would almost certainly strengthen the model.

Once you've finished, save and export your notebook as an HTML file and upload it to the Canvas assignment.¶

From the File menu, click "Save and Export Notebook As" and then select HTML.¶